Genome de contamination Analysis

This is an R Markdown Notebook for de_contamination analysis in genome assembly projects. Here, I explain how to run two different de-contamination programs (kraken2 and blobtools).

Kraken2

What is kraken2?

Is “a taxonomic sequence classifier that assigns taxonomic labels to DNA sequences. Kraken examines the k-mers within a query sequence and uses the information within those k-mers to query a database. That database maps k-mers to the lowest common ancestor (LCA) of all genomes known to contain a given k-mer”. (https://github.com/DerrickWood/kraken2/wiki/Manual#special-databases)

To run kraken2 you need your genome assembly and the index fai file for the assembly.

# /bin/sh
# ----------------Parameters---------------------- #
#$ -S /bin/sh
#$ -q mThC.q
#$ -pe mthread 12 -l mres=48G,h_data=4G,h_vmem=4G
#$ -cwd                                                      
#$ -j y
#$ -N kraken
#$ -o kraken.log
#$ -m bea
#$ -M ariasc@si.edu
#
# ----------------Modules------------------------- #
module load bioinformatics/kraken
#                                                       
# ----------------Your Commands------------------- #
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS
#
kraken2 --db /data/genomics/db/Kraken/kraken2_db/ --report kraken_report_remi --use-names --confidence --threads $NSLOTS /scratch/genomics/ariasc/ariasc_pool/krake2/HPA_04_pilon.fasta
#
echo = `date` job $JOB_NAME done

Kraken Output

Holocanthus passer

  • 486 sequences (583.53 Mbp) processed in 295.263s (0.1 Kseq/m, 118.58 Mbp/m).
  • 413 sequences classified (84.98%) 73 sequences unclassified (15.02%)

Exploring kmers distribution for classified sequences.

First we will get only the list of contigs and its classification

grep "Contig" kraken2.log > kraken2_results.txt

Get list of human and unclassified contigs: By doing this, we are removing anything classified as bacteria, plasmids or viruses. We keep the unclassified because they might contain real contigs.

grep "unclassified\|Homo sapiens" kraken2_results.txt | cut -f2 > assembly.fa.HsU.list

In order to explore kraken resuls, I download the results files and I use R to explore.

module load bio/rclone
rclone copy kraken2_results.txt db3:hydra_backup/

Exploring the results in R.

kraken_results<-read.delim("kraken2_results_remi.txt", header = F)
colnames(kraken_results) <-  c("Classification", "contig", "taxon", "length", "kmers")

sub_kraken_results <- filter(kraken_results, taxon !="Homo sapiens (taxid 9606)" & taxon !="unclassified (taxid 0)")

sub_kraken_results %>% kbl() %>% kable_styling()
Classification contig taxon length kmers
C ctg307_pilon_pilon cellular organisms (taxid 131567) 20487 0:8871 228410:3 0:1568 9606:5 0:7534 530564:5 0:2467
C ctg322_pilon_pilon Dickeya dadantii (taxid 204038) 20351 0:1233 274537:2 0:4898 204038:1 0:3846 204038:3 0:3226 204038:3 0:2931 204038:1 0:4173
C ctg350_pilon_pilon Cellulosilyticum lentocellum DSM 5427 (taxid 642492) 12701 0:169 9606:1 0:11990 642492:4 0:503
C ctg373_pilon_pilon Campylobacter lari (taxid 201) 10068 0:1440 201:5 0:8589
C ctg425_pilon_pilon Mycolicibacterium rhodesiae NBB3 (taxid 710685) 6563 0:951 710685:5 0:5573
C ctg429_pilon_pilon Legionella pneumophila (taxid 446) 6166 0:1766 446:5 0:4361
C ctg449_pilon_pilon Bartonella birtlesii IBS 325 (taxid 1095900) 5579 0:3273 1095900:5 0:2267
C ctg458_pilon_pilon Hymenobacter sedentarius (taxid 1411621) 3868 0:690 1411621:2 0:3142
C ctg481_pilon_pilon cellular organisms (taxid 131567) 9386 0:2913 1484116:5 0:143 9606:5 0:6286
C ctg480_pilon_pilon Burkholderia diffusa (taxid 488732) 5053 0:704 488732:5 0:4310
#sum(sub_kraken_results$length)
#100222 ~100Kbp

Kraken2 found ten contigs contaminated with bacteria and archea. This 10 contigs made 100kbp.

Filtering scaffolds with contamination

Using samtools, we will extract from the assembly only the contigs identified as Human or unclassified. Bacteria, viruses and plasmids will be excluded.

module load bioinformatics/samtools
xargs samtools faidx original_assembly.fa < assembly.fa.HsU.list > new_assembly.fa

Blobtools

What is blobtools?

Is “A modular command-line solution for visualisation, quality control and taxonomic partitioning of genome dataset” (https://github.com/DRL/blobtools)

Preparing files for blobtools

First, you need to blast your assembly to know nt databases. for this we will used blastn program. we will used our job file blast.job to the program in the server.

# /bin/sh
# ----------------Parameters---------------------- #        
#$  -S /bin/sh      
#$ -pe mthread 12       
#$ -q mThC.q                                                  
#$ -l mres=96G,h_data=8G,h_vmem=8G       
#$ -cwd
#$ -j y                             
#$ -N acti_blast    
#$ -o acti_blast.log      
#$ -m bea 
#$ -M ariasc@si.edu                                 
#     
# ----------------Modules------------------------- #
module load bioinformatics/blast        
#     
# ----------------Your Commands------------------- #      
#  
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
echo + NSLOTS = $NSLOTS 
blastn -db /data/genomics/db/ncbi/db/latest_v4/nt/nt -query HPA_04_pilon.fasta -outfmt "6 qseqid staxids bitscore std" -max_target_seqs 10 -max_hsps 1 -evalue 1e-20 -num_threads $NSLOTS -out HPA_blast.out
#                               
echo = `date` job $JOB_NAME done

Second, you need to map raw reads to the genome assembly. We used bwa to map and samtools to convert from sam to ban and finally sort the bam file.

#Index genome assembly
bwa index -p HPA_04_pilon.fasta

#run bwa
bwa mem -t $NSLOTS reads/HPA_04_R1.fastq reads/HPA_04_R2.fastq > HPA_04_pilon_bwa.sam
#comvert to bam and sort with samtools
samtools view -bu -@ $NSLOTS HPA_04_pilon_bwa.sam | samtools sort -T HPA_04 -O bam -@ $NSLOTS -l 9 -m 2G -o HPA_04_pilon_bwa_aligned_sorted.bam

Running blobtools

First, you need to create a database with blast and mapping files

# /bin/sh                     
# ----------------Parameters------------#
#$ -S /bin/sh 
#$ -q mThM.q    
#$ -l mres=24G,h_data=24G,h_vmem=24G,himem
#$ -cwd 
#$ -j y 
#$ -N blob_create  
#$ -o blob_create.log   
#$ -m bea 
#$ -M ariasc@si.edu
#              
# ----------------Modules------------ #
module load bioinformatics/blobtools
#
# ----------------Your_Commands---- #
#
echo + `date` job $JOB_NAME started in $QUEUE with jobID=$JOB_ID on $HOSTNAME
#                                         
blobtools create \                        
  -i HPA_04_pilon.fasta \                 
  -t HPA_blast.out \                      
  -b HPA_04_pilon_bwa_aligned_sorted.bam \
  -o HPA_04_blob                          
#                                         
echo = `date` job $JOB_NAME done

Second to view/summarize results you need to process the .json file or you can directly plot the results.

#you can used the interactive node in hydra since it do not uses to much memmory or time
qrsh
#to view results in as a table.
blobtools view -i HPA_04_blob.blobDB.json -o ./
#to plot results
blobtools plot -i HPA_04_blob.blobDB.json -o ./

This created two png files with the plots. Look at file in project folder.

knitr::include_graphics('HPA_04_blob.blobDB.json.bestsum.phylum.p7.span.100.blobplot.bam0.png')

knitr::include_graphics('HPA_04_blob.blobDB.json.bestsum.phylum.p7.span.100.blobplot.read_cov.bam0.png')

Bloobtools2

qrsh -pe mthread 8
module load bio/blobtools/2.6.3 
#Create a BlobDir
blobtools create --fasta /scratch/genomics/ariasc/remy/blob/HPA_04_pilon.fasta HPA_04_pilon

#Add BLAST hits
blobtools add --threads 8 --hits /scratch/genomics/ariasc/remy/blob/HPA_blast2.out --taxrule bestsumorder --taxdump /scratch/genomics/ariasc/remy/blob/blopbtools2_run/taxdump HPA_04_pilon

#Add mapping coverage
blobtools add --threads 8 --cov /scratch/genomics/ariasc/remy/blob/HPA_04_pilon_bwa_aligned_sorted.bam HPA_04_pilon

#Download db HPA_04_pilon folder to desktop and run blobtools view 

rclone copy HPA_04_pilon db3:hydra_backup/

#Create interactive html page with the blobtools results. this command was run in my personal mac computer were I install blobtools2 using conda.
#activate conda enviroment for blobtools2
conda activate btk_env
./blobtools view --remote /Users/solracarias/Desktop/HPA_04_pilon

You can play with viewer to change colors on figures, tables etc. Figures and tables can be downloaded and visualized.

#
#![Blob plot.](HPA_04_pilon.blob.circle.png)
knitr::include_graphics('HPA_04_pilon.blob.circle.png')

Blobtools Results

  • Total 486 (# of contigs),580Mb (sum length), 5.7Mb (N50)
  • Chordata 396 (# of contigs),580Mb (sum length), 5.7Mb (N50)
  • no-hit 90 (# of contigs),690kb (sum length), 9.1kb (N50)
#![Blob plot.](HPA_04_pilon.blob.circle.png)
knitr::include_graphics('HPA_04_pilon.cumulative.png')

#![Blob plot.](HPA_04_pilon.blob.circle.png)
knitr::include_graphics('HPA_04_pilon.snail.png')

blob_results<- read.csv("bestsumororder_table.csv",header = T, sep = ",")
blob_results<-select(blob_results, -1)
blob_results_nohits <- blob_results %>% filter(bestsumorder_phylum=="no-hit")

Kraken_blob_results<- merge(blob_results, kraken_results, by.x ="id", by.y = "contig",  all.x=F, all.y=F)
filt_kraken_blob_results <- filter(Kraken_blob_results, taxon !="Homo sapiens (taxid 9606)" & taxon !="unclassified (taxid 0)")
filt_kraken_blob_results %>% kbl() %>% kable_styling()
id X_id gc length.x bestsumorder_phylum bestsumorder_class bestsumorder_order bestsumorder_family bestsumorder_genus bestsumorder_species HPA_04_pilon_bwa_aligned_sorted_cov Classification taxon length.y kmers
ctg307_pilon_pilon 304 0.3908 20487 Chordata Actinopteri Blenniiformes Blenniidae Salarias Salarias fasciatus 241.0585 C cellular organisms (taxid 131567) 20487 0:8871 228410:3 0:1568 9606:5 0:7534 530564:5 0:2467
ctg322_pilon_pilon 323 0.4026 20351 Chordata Actinopteri Spariformes Sparidae Sparus Sparus aurata 129.5457 C Dickeya dadantii (taxid 204038) 20351 0:1233 274537:2 0:4898 204038:1 0:3846 204038:3 0:3226 204038:3 0:2931 204038:1 0:4173
ctg350_pilon_pilon 352 0.4053 12701 Chordata Actinopteri Perciformes Percidae Lateolabrax Lateolabrax maculatus 360.8253 C Cellulosilyticum lentocellum DSM 5427 (taxid 642492) 12701 0:169 9606:1 0:11990 642492:4 0:503
ctg373_pilon_pilon 368 0.3631 10068 Chordata Actinopteri Actinopteri-undef Pomacentridae Amphiprion Amphiprion ocellaris 155.7172 C Campylobacter lari (taxid 201) 10068 0:1440 201:5 0:8589
ctg425_pilon_pilon 418 0.4435 6563 Chordata Actinopteri Spariformes Sparidae Sparus Sparus aurata 184.2904 C Mycolicibacterium rhodesiae NBB3 (taxid 710685) 6563 0:951 710685:5 0:5573
ctg429_pilon_pilon 430 0.3800 6166 Chordata Actinopteri Pempheriformes Lateolabracidae Lateolabrax Lateolabrax maculatus 143.1620 C Legionella pneumophila (taxid 446) 6166 0:1766 446:5 0:4361
ctg449_pilon_pilon 449 0.3970 5579 Chordata Actinopteri Spariformes Sparidae Sparus Sparus aurata 74.8993 C Bartonella birtlesii IBS 325 (taxid 1095900) 5579 0:3273 1095900:5 0:2267
ctg458_pilon_pilon 450 0.4196 3868 Chordata Actinopteri Pempheriformes Lateolabracidae Lateolabrax Lateolabrax maculatus 0.0000 C Hymenobacter sedentarius (taxid 1411621) 3868 0:690 1411621:2 0:3142
ctg480_pilon_pilon 484 0.3725 5053 no-hit no-hit no-hit no-hit no-hit no-hit 2498.0321 C Burkholderia diffusa (taxid 488732) 5053 0:704 488732:5 0:4310
ctg481_pilon_pilon 478 0.4563 9386 Chordata Actinopteri Holocentriformes Holocentridae Myripristis Myripristis murdjan 157.8065 C cellular organisms (taxid 131567) 9386 0:2913 1484116:5 0:143 9606:5 0:6286